Audio User Interaction Recognition and Context Refinement

ABSTRACT

A system which performs social interaction analysis for a plurality of participants includes a processor. The processor is configured to determine a similarity between a first spatially filtered output and each of a plurality of second spatially filtered outputs. The processor is configured to determine the social interaction between the participants based on the similarities between the first spatially filtered output and each of the second spatially filtered outputs and display an output that is representative of the social interaction between the participants. The first spatially filtered output is received from a fixed microphone array, and the second spatially filtered outputs are received from a plurality of steerable microphone arrays each corresponding to a different participant.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application claims priority under the benefit of 35 U.S.C. §119(e)to Provisional Patent Application No. 61/645,818, filed May 11, 2012.This provisional patent application is hereby expressly incorporated byreference herein in its entirety.

BACKGROUND

A substantial amount of useful information can be derived fromdetermining the direction a user is looking at different points in time,and this information can be used to enhance the user's interaction witha variety of computational systems. Therefore, it is not surprising thata vast amount of gaze tracking research using a vision based approach(i.e., tracking the eyes using any of several various means) has alreadybeen undertaken. However, understanding a user's gazing direction onlygives semantic information on one dimension of the user's interest anddoes not take into account contextual information that is mostly givenby speech. In other words, the combination of gaze tracking coupled withspeech tracking would provide richer and more meaningful information ina variety of different user applications.

SUMMARY

Contextual information (that is, non-visual information that is beingsent or received by a user) is determined using an audio based approach.Audio user interaction on the receiving side may be enhanced by steeringaudio beams toward a specific person or a specific sound source. Thetechniques described herein may therefore allow a user to more clearlyunderstand the context of a conversation, for example. To achieve thesebenefits, inputs from one or more steerable microphone arrays and inputsfrom a fixed microphone array may be used to determine who a person islooking at or what a person is paying attention to relative to who isspeaking where audio-based contextual information (or even visual-basedsemantic information) is being presented.

For various implementations, two different types of microphone arraydevices (MADs) are used. The first type of MAD is a steerable microphonearray (also referred to herein as a steerable array) which is worn by auser in a known orientation with regard to the user's eyes, and multipleusers may each wear a steerable array. The second type of MAD is afixed-location microphone array (also referred to herein as a fixedarray) which is placed in the same acoustic space as the users (one ormore of which are using steerable arrays).

For certain implementations, the steerable microphone array may be partof an active noise control (ANC) headset or hearing aid. There may bemultiple steerable arrays, each associated with a different user orspeaker (also referred to herein as a participant) in a meeting orgroup, for example. The fixed microphone array, in such a context, wouldthen be used to separate different people speaking and listening duringthe group meeting using audio beams corresponding to the direction inwhich the different people are located relative to the fixed array.

The correlation or similarity between the audio beams of the separatedspeakers of fixed array and the outputs of the steerable arrays areevaluated. Correlation is one example of a similarity measure, althoughany of several similarity measurement or determination techniques may beused.

In an implementation, the similarity measure between the audio beams ofthe separated participants of the fixed array and the outputs ofsteerable arrays may be used to track social interaction betweenparticipants, including gazing direction of the participants over timeas different participants speak or present audio-based information.

In an implementation, the similarity measure between the audio beams ofthe separated participants of the fixed array and the outputs ofsteerable arrays may be used to zoom in on a targeted participant, forexample. This zooming might in turn lead to enhanced noise filtering andamplification when one user (who at that moment is a listener) is gazingat another person who is providing audio-based information (i.e.,speaking).

In an implementation, the similarity measure between the audio beams ofthe separated participants of the fixed array and the outputs ofsteerable arrays may be used to adaptively form a better beam for atargeted participant, in effect better determining the physicalorientation of each of the users relative to each other.

This summary is provided to introduce a selection of concepts in asimplified form that are further described below in the detaileddescription. This summary is not intended to identify key features oressential features of the claimed subject matter, nor is it intended tobe used to limit the scope of the claimed subject matter.

BRIEF DESCRIPTION OF THE DRAWINGS

The foregoing summary, as well as the following detailed description ofillustrative embodiments, is better understood when read in conjunctionwith the appended drawings. For the purpose of illustrating theembodiments, there are shown in the drawings example constructions ofthe embodiments; however, the embodiments are not limited to thespecific methods and instrumentalities disclosed. In the drawings:

FIG. 1 is a diagram of a group of users each wearing a steerablemicrophone array, along with a fixed microphone array, that may be usedto determine contextual information;

FIG. 2 is an operational flow of an implementation of a method ofdetermining user interaction using steerable microphone arrays and afixed microphone array;

FIG. 3 is an operational flow of another implementation of a method ofdetermining user interaction using steerable microphone arrays and afixed microphone array;

FIG. 4 is a diagram of an example display that may provide an indicationof a user identity and which direction the user is looking;

FIG. 5 is a diagram of a user interface that may be generated anddisplayed and that indicates various user interactions and meeting data;

FIG. 6 is a diagram of an example display of a user interface that maybe generated and displayed (e.g., on a smartphone display) and thatindicates various user interactions (e.g., during a meeting);

FIG. 7 is a diagram of an example display that indicates various userinteractions with respect to various topics;

FIG. 8 is a diagram of an example display that indicates various userinteractions over time;

FIG. 9 is a diagram of another example display that indicates varioususer interactions over time;

FIG. 10 is an operational flow of an implementation of a method of ameasuring similarity using cross-correlation;

FIG. 11 is an operational flow of an implementation of a method ofmeasuring similarity using cross-cumulant;

FIG. 12 is an operational flow of an implementation of a method ofmeasuring similarity using time-domain least squares fit;

FIG. 13 is an operational flow of an implementation of a method ofmeasuring similarity using frequency-domain least squares fit;

FIG. 14 is an operational flow of an implementation of a method ofmeasuring similarity using Itakura-Saito distance;

FIG. 15 is an operational flow of an implementation of a method ofmeasuring similarity using a feature based approach;

FIG. 16 shows an example user interface display;

FIG. 17 shows an exemplary user interface display to show collaborativezooming on the display;

FIG. 18 is an operational flow of an implementation of a method forzooming into a target participant;

FIG. 19 shows an example user interface display with additionalcandidate look directions;

FIG. 20 is an operational flow of an implementation of a method foradaptively refining beams for a targeted speaker;

FIG. 21 shows a far-field model of plane wave propagation relative to amicrophone pair;

FIG. 22 shows multiple microphone pairs in a linear array;

FIG. 23 shows plots of unwrapped phase delay vs. frequency for fourdifferent DOAs, and FIG. 24 shows plots of wrapped phase delay vs.frequency for the same DOAs;

FIG. 25 shows an example of measured phase delay values and calculatedvalues for two DOA candidates;

FIG. 26 shows a linear array of microphones arranged along the topmargin of a television screen;

FIG. 27 shows an example of calculating DOA differences for a frame;

FIG. 28 shows an example of calculating a DOA estimate;

FIG. 29 shows an example of identifying a DOA estimate for eachfrequency;

FIG. 30 shows an example of using calculated likelihoods to identify abest microphone pair and best DOA candidate for a given frequency;

FIG. 31 shows an example of likelihood calculation;

FIG. 32 shows an example of a speakerphone application;

FIG. 33 shows a mapping of pair-wise DOA estimates to a 360° range inthe plane of the microphone array;

FIGS. 34 and 35 show an ambiguity in the DOA estimate;

FIG. 36 shows a relation between signs of observed DOAs and quadrants ofan x-y plane;

FIGS. 37-40 show an example in which the source is located above theplane of the microphones;

FIG. 41 shows an example of microphone pairs along non-orthogonal axes;

FIG. 42 shows an example of use of the array of FIG. 41 to obtain a DOAestimate with respect to the orthogonal x and y axes;

FIGS. 43 and 44 show examples of pair-wise normalized beamformer/nullbeamformers (BFNFs) for a two-pair microphone array (e.g., as shown inFIG. 45);

FIG. 46 shows an example of a pair-wise normalized minimum variancedistortionless response (MVDR) BFNF;

FIG. 47 shows an example of a pair-wise BFNF for frequencies in whichthe matrix A^(H)A is not ill-conditioned;

FIG. 48 shows examples of steering vectors; and

FIG. 49 shows a flowchart of an integrated method of source directionestimation as described herein.

DETAILED DESCRIPTION

Unless expressly limited by its context, the term “signal” is usedherein to indicate any of its ordinary meanings, including a state of amemory location (or set of memory locations) as expressed on a wire,bus, or other transmission medium. Unless expressly limited by itscontext, the term “generating” is used herein to indicate any of itsordinary meanings, such as computing or otherwise producing. Unlessexpressly limited by its context, the term “calculating” is used hereinto indicate any of its ordinary meanings, such as computing, evaluating,estimating, and/or selecting from a plurality of values. Unlessexpressly limited by its context, the term “obtaining” is used toindicate any of its ordinary meanings, such as calculating, deriving,receiving (e.g., from an external device), and/or retrieving (e.g., froman array of storage elements). Unless expressly limited by its context,the term “selecting” is used to indicate any of its ordinary meanings,such as identifying, indicating, applying, and/or using at least one,and fewer than all, of a set of two or more. Where the term “comprising”is used in the present description and claims, it does not exclude otherelements or operations. The term “based on” (as in “A is based on B”) isused to indicate any of its ordinary meanings, including the cases (i)“derived from” (e.g., “B is a precursor of A”), (ii) “based on at least”(e.g., “A is based on at least B”) and, if appropriate in the particularcontext, (iii) “equal to” (e.g., “A is equal to B” or “A is the same asB”). Similarly, the term “in response to” is used to indicate any of itsordinary meanings, including “in response to at least.”

References to a “location” of a microphone of a multi-microphone audiosensing device indicate the location of the center of an acousticallysensitive face of the microphone, unless otherwise indicated by thecontext. The term “channel” is used at times to indicate a signal pathand at other times to indicate a signal carried by such a path,according to the particular context. Unless otherwise indicated, theterm “series” is used to indicate a sequence of two or more items. Theterm “logarithm” is used to indicate the base-ten logarithm, althoughextensions of such an operation to other bases are within the scope ofthis disclosure. The term “frequency component” is used to indicate oneamong a set of frequencies or frequency bands of a signal, such as asample (or “bin”) of a frequency domain representation of the signal(e.g., as produced by a fast Fourier transform) or a subband of thesignal (e.g., a Bark scale or mel scale subband).

Unless indicated otherwise, any disclosure of an operation of anapparatus having a particular feature is also expressly intended todisclose a method having an analogous feature (and vice versa), and anydisclosure of an operation of an apparatus according to a particularconfiguration is also expressly intended to disclose a method accordingto an analogous configuration (and vice versa). The term “configuration”may be used in reference to a method, apparatus, and/or system asindicated by its particular context. The terms “method,” “process,”“procedure,” and “technique” are used generically and interchangeablyunless otherwise indicated by the particular context. The terms“apparatus” and “device” are also used generically and interchangeablyunless otherwise indicated by the particular context. The terms“element” and “module” are typically used to indicate a portion of agreater configuration. Unless expressly limited by its context, the term“system” is used herein to indicate any of its ordinary meanings,including “a group of elements that interact to serve a common purpose.”

Any incorporation by reference of a portion of a document shall also beunderstood to incorporate definitions of terms or variables that arereferenced within the portion, where such definitions appear elsewherein the document, as well as any figures referenced in the incorporatedportion. Unless initially introduced by a definite article, an ordinalterm (e.g., “first,” “second,” “third,” etc.) used to modify a claimelement does not by itself indicate any priority or order of the claimelement with respect to another, but rather merely distinguishes theclaim element from another claim element having a same name (but for useof the ordinal term). Unless expressly limited by its context, each ofthe terms “plurality” and “set” is used herein to indicate an integerquantity that is greater than one.

A combination visual- and hearing-based approach is described herein toenable a user to steer towards a person (or a sound source) in order tomore clearly understand the audio-based information being presented atthat moment (e.g., the context of conversation and/or the identity ofthe sound source) using sound sensors and a variety of position-basedcalculations and resulting interaction enhancements.

For example, the correlation or similarity between the audio beams ofthe separated speakers of the fixed array and the outputs of steerablearrays may be used to track social interaction between speakers.Correlation is just one example of a similarity measure, and anysimilarity measurement or determination technique may be used.

More particularly, a social interaction or social networking analysis ofa group of users (also referred to herein as speakers or participants)may be performed and displayed using a connection graph generatedresponsive to the correlation or other similarity measure between theaudio beams of the separated speakers of the fixed array and the outputof each steerable array respectively associated with each user of thegroup. Thus, for example, automatic social network analysis may beperformed in a group meeting of participants, using a connection graphamong the meeting participants, to derive useful information regardingwho was actively engaged in the presentation or more generally theeffectiveness of the presentation in holding the attention of the users.

FIG. 1 is a diagram 100 of a group of users each wearing a steerablemicrophone array 110, along with a fixed-location microphone array 150in the same space (e.g., room) as the users, which may be used todetermine contextual information. As shown in FIG. 1, each user 105 of agroup of users in a room (or other defined space) wears a steerablemicrophone array (e.g., as a headset that may include the ability toperform adaptive noise control (ANC)), and a fixed-location microphonearray 150 is located in the room (e.g., on a table, in a phone, etc.).The fixed-location microphone array 150 may be part of an electronicdevice such as a video game platform, tablet, notebook, or smartphone,for example, or may be a standalone device or implementation.Alternatively or additionally, the fixed-location microphone array 150may comprise a distributed microphone array (i.e., distributedmicrophones).

A user 105 wearing the headset may generate a fixed beam-pattern 120from his steerable (e.g., wearable) microphone array which is pointed inthe user's physical visual (or “look”) direction. If the user turns hishead, then the user's look direction of the beam-pattern is alsochanged. The active speaker's location may be determined using the fixedmicrophone array. By correlating, or otherwise determining thesimilarity of, beamformed output (or any type of spatially filteredoutput) from the steerable microphone array with the fixed microphonearray outputs corresponding to each active speaker, the identificationmay be determined of the person that a user is looking at (e.g., payingattention to, listening to, etc.). Each headset may be have processorthat is in communication (e.g., via a wireless communications link) witha main processor (e.g., in a centralized local or remote computingdevice) to analyze correlations or similarities of beams between theheadsets and/or the fixed arrays.

In other words, fixed beam patterns at any moment in time may be formedbased on a user's physical look direction which can be correlated withthe fixed microphone array outputs, thereby providing a visualindication, via a connection graph 130 (e.g., displayed on a display ofany type of computing device, such as a handset, a laptop, a tablet, acomputer, a netbook, or a mobile computing device), of the socialinteraction of the targeted users. Thus, by correlating a beamformedoutput from the steerable microphone array with the fixed microphonearray outputs, corresponding to each active speaking user, tracking of asocial interaction or network analysis may be performed and displayed.Moreover, by checking the similarity between beamformed output from thelook-direction-steerable microphone array and the location-fixedmicrophone array outputs corresponding to each active speaker, theperson that a user is looking at or paying attention to can beidentified and zoomed into.

FIG. 2 is an operational flow of an implementation of a method 200 ofdetermining user interaction using steerable microphone arrays and afixed microphone array. At 210, the steerable microphone arrays and thefixed microphone array each receive sound at roughly the same time(although small variations can be detected and used to calculaterelative positions of the user). At 220, a spatially filtered output,such as a beamformed output, is generated by each of the steerablemicrophone arrays and the fixed microphone array. At 230, the spatiallyfiltered output of each steerable microphone array is compared with thespatially filtered output of the fixed microphone array. Any knowntechnique for determining similarity or correlation may be used. At 240,the similarity or correlation information obtained from 230 may be usedto determine and/or display user interaction information, as describedfurther herein.

FIG. 3 is an operational flow of another implementation of a method 300of determining user interaction using steerable microphone arrays and afixed-location microphone array. Each of a plurality of users has asteerable stereo microphone array, such as an ANC headset, that has aknown orientation corresponding to the visual gazing direction of eachsuch user. Each of the steerable arrays (in the ANC headsets) providesfixed broadside beamforming at 305, in which a beamformed output (or anytype of spatially filtered output) is generated in the user lookdirection at 310 (i.e., in the direction the user of the steerable arrayis looking).

A fixed microphone array (such as in a smartphone) with an associatedprocessor performs a direction of arrival (DOA) estimation at 320 inthree dimensions (3D) around the fixed microphone array and separatesthe active speakers at 325. The number of active speakers is determinedat 370, and a separate output for each active speaker (identified by anidentification number for example) is generated at 380. In animplementation, speaker recognition and labeling of the active speakersmay be performed at 330.

The similarity is measured between the separated speakers of the fixedarray and the outputs of the steerable arrays at 340. Using the measuredsimilarity and the DOA estimation and the speaker IDs, a visualizationof the user interaction (with speaker identity (ID) or participant ID)may be generated and displayed at 350. Each user's look direction may beprovided to the fixed array as a smartphone coordinate for example, at360.

A connection graph (also referred to as an interaction graph) may begenerated which displays (a) who is talking and/or listening to whomand/or looking at whom, (b) who is dominating and/or leading thediscussion of the group, and/or (c) who is bored, not participating,and/or quiet, for example. Real-time meeting analysis may be performedto assist the efficiency of the meeting and future meetings. Informationsuch as time of meeting, place (e.g., meeting location), speakeridentity or participant identity, meeting topic or subject matter, andnumber of participants, for example, may be displayed and used in theanalysis.

FIG. 4 is a diagram 400 of an example display 403 that may provide anindication of a user identity and which direction the user is looking.The user identity (participant ID 406) is displayed along with thedirection that the user is looking (participant look direction 410).During a meeting, for example, this display of the participant lookdirection 410 may be generated and provided to an interested party, suchas a meeting administrator or leader or supervisor, so that theinterested party may see who the participant is looking at at varioustimes during the meeting. Although only one participant ID 406 andparticipant look direction 410 is shown in the diagram 403, this is notintended to be limited. The interested party may receive suchinformation for more than one participant, and such information may bedisplayed concurrently on one or more displays depending on theimplementation. The data that is generated for display on the display403 may be stored in a memory and retrieved and displayed at a latertime, as well as being displayed in real-time.

FIG. 5 is a diagram 415 of a user interface that may be generated anddisplayed on a display 418 and that indicates various user interactionsand meeting data. Various types of information may be generated anddisplayed (e.g., in real-time during a meeting), such as the identifier(ID) of the participant who is talking 420, the ID of the participant(s)that is listening 422, and/or the ID of the participant(s) that is notparticipating 424 (e.g., not listening at the moment, not listening formore than a predetermined about of time or for at least a percentage ofthe meeting, looking somewhere other than the participant who is talkingor looking in an another predetermined location or direction, etc).During a meeting, for example, this display 4108 may be generated andprovided to an interested party, such as a meeting administrator orleader or supervisor.

Additional data may be displayed on the display 418, such as the meetingtime 426, the meeting location 428, the length of the meeting 430 (i.e.,the duration), the meeting topic 432, and the number of meetingparticipants 434. Some or all of this data may be displayed.Additionally or alternatively, other data may be displayed, depending onthe implementation, such as the IDs of all the participants and otherstatistics that may be generated as described further herein. Theinformation and data that is generated for display on the display 418may be stored in a memory and retrieved and displayed at a later time,as well as being displayed in real-time.

It is noted that a participant will be participating even if she is justlistening at the meeting (and not speaking) because that participant'smicrophone (steerable microphone array) will still be picking up thesounds in the direction she is viewing while she is listening. Thus,even if a participant does not speak, there will still be sounds toanalyze that are associated with her listening.

A user interface may be generated and displayed (e.g., on a smartphonedisplay or other computing device display such as a display associatedwith a handset, a laptop, a tablet, a computer, a netbook, or a mobilecomputing device) that indicates the various user interactions duringthe meeting. FIG. 4 is a diagram of an example display of a userinterface 440 that may be generated and displayed (e.g., on a smartphonedisplay 443) and that indicates various user interactions (e.g., duringa meeting). In this example, the direction of each arrow 454 indicateswho is looking at whom (only one arrow 454 is shown in this example,though a plurality of such arrows may be shown depending on theimplementation and user interactions at a particular time). Thethickness of each arrow indicates relatively how strong the interactionis (e.g., based on connected time, etc.). No arrow from or to a personindicates that the user is not involved in the group meeting. Apercentage number may be displayed for a user which indicates aparticipation rate for the group meeting. An indicator 448 may bedisplayed to identify the leader of the meeting, and percentages 450,452 may be determined and displayed to show how much of the discussionis directed to a person, and how much of the discussion is directed fromthe person, respectively. In an implementation, a color or highlightingmay be used to indicate the leader of a group of participants.

In the example of FIG. 6, John and Mark are interacting a lot, asindicated by the relatively big thick arrow 446. Mary is being quiet.Real-time meeting analysis (such as that described above with respect toFIGS. 4 and 5, and elsewhere herein) may be performed to assist theefficiency of the meeting. For example, because it looks like Mary isout of the conversation, John may encourage Mary to participate (e.g.,by asking a question of Mary).

Social interaction plots may be accumulated over a time period (e.g.,over a month, a year, etc.) to assess group dynamics or topic dynamics,for example. FIG. 7 is a diagram 460 of an example display 462 thatindicates various user interactions with respect to various topics 464.This information may be captured during one or more meetings, stored ina memory (or multiple memories), and displayed in one or more formats ata later time, e.g., during a historical analysis of data. Here, eachparticipant ID 466 is listed along with their participation rates 468for the various topics 464.

Thus, for example, Jane has a 20% participation rate in meetings about“Design”, a 40% participation rate in meetings about “Code Walkthrough”,and a 10% participation rate in meetings about “Documentation”. Thisdata may be used to determine which participants are most suited for, orinterested in, a particular topic, for example, or which participantsmay need more encouragement with respect to a particular topic.Participation rates may be determined and based on one or more dataitems described herein, such as amount of time speaking at the meeting,amount of time paying attention at the meeting, amount of time listeningat the meeting, etc. Although percentages are shown in FIG. 7, anyrelative measuring, numbering, or indicating system or technique may beused to identify relative strengths and/or weaknesses in participatinglevels or rates.

An “L” in the diagram 460 is used as an example indicator to indicatewhich user participated most in a certain topic, thereby indicating apotential leader for that topic for example. Any indicator may be used,such as a color, highlighting, or a particular symbol. In this example,John is the most participating in Design, Jane is the most participatingin Code Walkthrough, and Mary is the most participating inDocumentation. Accordingly, they may be identified as potential leadersin the respective topics.

Additionally, a personal time line with an interaction history may begenerated for one or more meeting participants. Thus, not only a singlesnapshot or period of time during a meeting may be captured, analyzed,and information pertaining to it displayed (either in real-time or lateroffline), but also history over time may be stored (e.g., in a memory ofa computing device such as a smartphone or any type of computing device,such as a handset, a laptop, a tablet, a computer, a netbook, or amobile computing device), analyzed, and displayed (e.g., in a calendaror other display of a computing device such as a smartphone any type ofcomputing device, such as a handset, a laptop, a tablet, a computer, anetbook, or a mobile computing device).

FIG. 8 is a diagram 470 of an example display 472 that indicates varioususer interactions over time, that may be used for historical analysis,e.g., after one or more meetings. Here, a user identifier 474 isprovided, along with information such as the meeting date and themeeting topic. The information 478 on this display 472 is provided overtime 476. It shows information 478, for each period or instant of time,such as who the user was looking at that period or instant of time,whether the user was speaking then, and the percentage of meetingparticipants that were looking at the user at the period or instant oftime. This information 478 can be determined at predetermined timesduring a meeting (e.g., every minute, every 5 minutes, etc.), ordetermined as an average or other weighted determination over particularperiods of time, for example. This information is provided as an exampleonly and is not meant to be limiting; additional or alternativeinformation can be generated and displayed as information 478.

The information displayed in FIG. 8 can be used for meeting analysis anduser analysis. Thus, in FIG. 8, it may be determined that the user Janetypically looks at Mary or Mark when Jane is not speaking, but Janelooks at John when Jane is speaking. FIG. 8 also indicates that whenJane is not speaking, the percentage of participants looking at Jane iszero, but this percentage increases as Jane is speaking.

Interaction statistics may also be generated, stored, analyzed, anddisplayed. For example, the evolution of interaction between people canbe tracked and displayed. Recursive weighting over time may be used(e.g., 0.9*historical data+0.1*current data), such that as data getsolder, it becomes less relevant, with the most current data beingweighted the highest (or vice versa). In this manner, a user may be ableto see which people he or others are networking with more than others.Additional statistics may be factored into the analysis to provide moreaccurate interaction information. For example, interaction informationobtained from email exchanges or other communication may be used(combined with) the meeting, history, and/or participant interactiondata to provide additional (e.g., more accurate) interactioninformation.

FIG. 9 is a diagram 480 of another example display 482 that indicatesvarious user interactions over time. Here, a user Jane is identifiedalong with an interaction scale 488 and a time period. The diagram 480shows other user IDs 484 and a listing of months 486 in the past. Theinteraction scale in this example ranges from 0 to 10, with 0representing no interaction and 10 representing a very stronginteraction between the identified user and Jane in each of the months486. This information may be generated and provided as historical dataand used, e.g., by a meeting participant or a leader or supervisor toview and analyze the various user interactions over time, e.g., to seewho is most strongly interacting with whom when.

As another example, online learning monitoring may be performed todetermine whether a student in a remote site is actively participatingor not. Likewise, an application for video games with participantinteraction is also contemplated in which there may be immediaterecognition of where the users are looking among the possible soundevent locations.

FIG. 10 is an operational flow of an implementation of a method 500, anduses cross-correlation as an exemplary measure although any similaritymeasurement technique may be used. At 503, the fixed microphone arrayprovides a number of active speakers N and the active speakers'separated speech signals. One signal (the sound) is received by thefixed microphone array. The output of the fixed microphone arraycomprises beams, one beam corresponding to each participant. Thus, aseparate output is associated with each participant. At 510, thesteerable microphone array provides the user's look direction. For eachuser, the individual user's output is correlated with each of thebeamforms (or other spatially filtered output) that are outputted fromthe fixed microphone array.

Location mapping may be generated using this information, at 515.Information pertaining to when a user turns to someone and looks at themmay be leveraged. A well known classic correlation equation, such asthat shown at 506, may be used as shown, where E is equal to theexpectation value and c is the correlation value. Whenever there is amaximum peak, that is the angle of strong correlation. In animplementation, the maximum allowable time shift may be predeterminedusing a physical constraint or system complexity. For example, the timedelay between steerable microphones and fixed microphones can bemeasured and used, when only the user, who wears the steerable array, isactive. Note that the conventional frame length 20 ms corresponds toalmost 7 meters. The angle θ is the relative angle at which the activespeaker is located relative to the listening user. The angle θ may bedetermined between the fixed array and the steerable array, at 513.

FIG. 11 is an operational flow of an implementation of a method 520 ofmeasuring similarity, and uses cross-cumulant as an exemplary measurealthough any similarity measurement technique may be used. The fixedmicrophone array provides a number of active speakers N and the activespeakers' separated speech signals, at 523. One signal (the sound) isreceived by the fixed microphone array. The output of the fixedmicrophone array comprises beams, one beam corresponding to eachparticipant. Thus, a separate output is associated with eachparticipant. The steerable microphone array provides the user's lookdirection, at 530. For each user, the individual user's output iscorrelated with each of the beamforms (or other spatially filteredoutput) that is outputted from the fixed microphone array.

Location mapping may be generated using this information, at 525.Information pertaining to when a user turns to someone and looks at themmay be leveraged. A well known classic cumulant equation, shown at 526,may be used as shown, where E is equal to the expectation value and c isthe correlation value. Whenever there is a maximum peak, that is theangle of strong correlation. The angle θ is the relative angle at whichthe active speaker is located relative to the listening user. The angleθ may be determined between the fixed array and the steerable array, at513.

It is noted that any similarity or correlation technique may be used.Regarding a possible similarity measure, virtually any distancemetric(s) may be used such as, but not limited to the well knowntechniques of: (1) least square fit with allowable time adjustment:time-domain or frequency-domain; (2) feature based approach: usinglinear prediction coding (LPC) or mel-frequency cepstral coefficients(MFCC); and (3) higher order based approach: cross-cumulant, empiricalKullback-Leibler Divergence, or Itakura-Saito distance.

FIG. 12 is an operational flow of an implementation of a method 540 ofmeasuring similarity using time-domain least squares fit, and FIG. 13 isan operational flow of an implementation of a method 550 of measuringsimilarity using frequency-domain least squares fit. The method 540,using a time-domain least squares fit, is similar to the method 520 ofFIG. 11 described above, except that instead of using a cumulantequation of 526, a time domain equation shown at 542 may be used asshown. Similarly, the method 550 is similar to the method 520 of FIG. 11but instead of using energy normalization, uses a fast Fourier transform(FFT) in conjunction with the frequency domain equation shown at 552.

FIG. 14 is an operational flow of an implementation of a method 560 ofmeasuring similarity using Itakura-Saito distance. This technique issimilar to the FFT technique of FIG. 13, but uses the equation shown at562. FIG. 15 is an operational flow of an implementation of a method 570of measuring similarity using a feature based approach. Featureextraction is performed, as shown at 573 and 575, and used inconjunction with the other operations 503, 510, 513, and 515 of FIG. 10,and the equation shown at 572.

In an implementation, the correlation or similarity between the audiobeams of the separated speakers of the fixed microphone array and theoutputs of the steerable microphone arrays may be used to zoom into atargeted speaker. This type of collaborative zooming may provide a userinterface for zooming into a desired speaker.

In other words, collaborative zooming may be performed wherein a userinterface is provided for multiple users with multiple devices forzooming into a target speaker by just looking at the target speaker.Beamforming may, be produced at the targeted person via either theheadsets or handsets such that all available resources of multipledevices can be combined for collaborative zooming, thereby enhancing thelook direction of the targeted person.

For example, a user may look at a target person, and beamforming may beproduced at the targeted person by either using the headset or a handset(whichever is closer to the target person). This may be achieved byusing a device that includes a hidden camera with two microphones. Whenmultiple users of multiple devices look at the target person, thecamera(s) can visually focus on the person. In addition, the device(s)can audibly focus (i.e., zoom in on) the person by using (e.g., all)available microphones to enhance the look direction of the targetperson.

Additionally, the target person can be audibly zoomed in on by nullingout other speakers and enhancing the target person's voice. Theenhancement can also be done using a headset or handset, whichever iscloser to the target person.

An exemplary user interface display 600 is shown in FIG. 16. The display(e.g., displayed on a smartphone display 610 or other display device)shows the active user location 620 and an associated energy 630. FIG. 17shows an exemplary user interface display to show collaborative zoomingon the display, in which Speaker 1 is zoomed in on as shown in thedisplay 660 from the initial display 650.

FIG. 18 is an operational flow of an implementation of a method 700 forzooming into a target person. As in FIG. 3, a steerable array 705 (in anANC headset) provides fixed broadside beamforming at 710, in which abeamformed output is generated in the user look direction (i.e., in thedirection the user of the steerable array is looking). A fixedmicrophone array 707 (such as in a smartphone) with an associatedprocessor performs a DOA estimation in three dimensions around the fixedmicrophone array and separates the active speakers, at 720. The numberof active speakers is determined, and a separate output for each activespeaker (identified by an identification number for example) isgenerated.

In an implementation, speaker recognition and labeling of the activespeakers may be performed at 730. At 750, a correlation or similarity isdetermined between the separated speakers of the fixed array and theoutputs of the steerable arrays. Using the correlation or similaritymeasurement and the speakers' IDs, a target user can be detected,localized, and zoomed into, at 760.

The user can be replaced with a device, such as a hidden camera with twomicrophones, and just by looking at the targeted person, the targetedperson can be focused on with zooming by audition as well as by vision.

A camcorder application with multiple devices is contemplated. The lookdirection is known, and all available microphones of other devices maybe used to enhance the look direction source.

In an implementation, the correlation or similarity between the audiobeams of the separated speakers of the fixed array and the outputs ofsteerable arrays may be used to adaptively form a better beam for atargeted speaker. In this manner, the fixed microphones beamformer maybe adaptively refined, such that new look directions can be adaptivelygenerated by a fixed beamformer.

For example, the headset microphone array's beamformer output can beused as a reference to refine the look direction of fixed microphonearray's beamformer. The correlation or similarity between the headsetbeamformer output and the current fixed microphone array beamformeroutput may be compared with the correlation or similarity between theheadset beamformer output and the fixed microphone array beamformeroutputs with slightly moved look directions.

FIG. 19 shows an example user interface display 800 with additionalcandidate look directions 810. By leveraging the correlation orsimilarity between the headset beamformer output with the original fixedmicrophone beamformer outputs 820, as shown in FIG. 19, new candidatelook directions by a fixed beamformer can be generated. Using thistechnique, the headset microphone beamformer output can be used as areference to refine the look direction of the fixed microphonebeamformer. For example, speaker 1 in FIG. 19 may be speaking, and as hespeaks new candidate look directions can be adaptively formed.

FIG. 20 is an operational flow of an implementation of a method 900 foradaptively refining beams for a targeted speaker. As in FIG. 3, asteerable array 905 (for example, in an ANC headset) provides fixedbroadside beamforming at 910, in which a beamformed output is generatedin the user look direction (i.e., in the direction the user of thesteerable array is looking). A fixed microphone array 907 (such as in asmartphone) with an associated processor performs a DOA estimation inthree dimensions around the fixed microphone array and separates theactive speakers, at 920. The number of active speakers is determined,and a separate output for each active speaker (identified by anidentification number for example) is generated. As with FIG. 18, acorrelation or similarity is determined between the separated speakersof the fixed array and the outputs of the steerable arrays, at 950.

Continuing with FIG. 20, the determined correlation or similarity isused to increase the angular resolution near the DOAs of the activeusers, and a separation of the active speakers is again performed, at960. Using the increased angular resolution and the outputs of thesteerable arrays, another correlation or similarity measure isdetermined between the separated speakers of the fixed array and theoutputs of the steerable arrays, at 970. This correlation or similaritymeasure may then be used to zoom into a target speaker, at 980.

It is a challenge to provide a method for estimating a three-dimensionaldirection of arrival (DOA) for each frame of an audio signal forconcurrent multiple sound events that is sufficiently robust underbackground noise and reverberation. Robustness can be obtained bymaximizing the number of reliable frequency bins. It may be desirablefor such a method to be suitable for arbitrarily shaped microphone arraygeometry, such that specific constraints on microphone geometry may beavoided. A pair-wise 1-D approach as described herein can beappropriately incorporated into any geometry.

A solution may be implemented for such a generic speakerphoneapplication or far-field application. Such an approach may beimplemented to operate without a microphone placement constraint. Suchan approach may also be implemented to track sources using availablefrequency bins up to Nyquist frequency and down to a lower frequency(e.g., by supporting use of a microphone pair having a largerinter-microphone distance). Rather than being limited to a single pairfor tracking, such an approach may be implemented to select a best pairamong all available pairs. Such an approach may be used to supportsource tracking even in a far-field scenario, up to a distance of threeto five meters or more, and to provide a much higher DOA resolution.Other potential features include obtaining an exact 2-D representationof an active source. For best results, it may be desirable that eachsource is a sparse broadband audio source, and that each frequency binis mostly dominated by no more than one source.

For a signal received by a pair of microphones directly from a pointsource in a particular DOA, the phase delay differs for each frequencycomponent and also depends on the spacing between the microphones. Theobserved value of the phase delay at a particular frequency bin may becalculated as the inverse tangent of the ratio of the imaginary term ofthe complex FFT coefficient to the real term of the complex FFTcoefficient. As shown in FIG. 21, the phase delay value Δφ_(f) at aparticular frequency f may be related to source DOA under a far-field(i.e., plane-wave) assumption as

${{\Delta \; \phi_{f}} = {2\pi \; f\; \frac{d\; \sin \; \theta}{c}}},$

where d denotes the distance between the microphones (in m), θ denotesthe angle of arrival (in radians) relative to a direction that isorthogonal to the array axis, f denotes frequency (in Hz), and c denotesthe speed of sound (in m/s). For the ideal case of a single point sourcewith no reverberation, the ratio of phase delay to frequency Δφ/f willhave the same value

$2\pi \; \frac{d\; \sin \; \theta}{c}$

over all frequencies.

Such an approach is limited in practice by the spatial aliasingfrequency for the microphone pair, which may be defined as the frequencyat which the wavelength of the signal is twice the distance d betweenthe microphones. Spatial aliasing causes phase wrapping, which puts anupper limit on the range of frequencies that may be used to providereliable phase delay measurements for a particular microphone pair. FIG.23 shows plots of unwrapped phase delay vs. frequency for four differentDOAs, and FIG. 24 shows plots of wrapped phase delay vs. frequency forthe same DOAs, where the initial portion of each plot (i.e., until thefirst wrapping occurs) are shown in bold. Attempts to extend the usefulfrequency range of phase delay measurement by unwrapping the measuredphase are typically unreliable.

Instead of phase unwrapping, a proposed approach compares the phasedelay as measured (e.g., wrapped) with pre-calculated values of wrappedphase delay for each of an inventory of DOA candidates. FIG. 25 showssuch an example that includes angle-vs.-frequency plots of the (noisy)measured phase delay values (gray) and the phase delay values for twoDOA candidates of the inventory (solid and dashed lines), where phase iswrapped to the range of pi to minus pi. The DOA candidate that is bestmatched to the signal as observed may then be determined by calculating,for each DOA candidate θ_(i), a corresponding error e_(i) between thephase delay values Δφ_(i) _(—) _(j) for the i-th DOA candidate and theobserved phase delay values Δφ_(ob) _(—f) over a range of frequencycomponents f, and identifying the DOA candidate value that correspondsto the minimum error. In one example, the error e_(i) is expressed as∥Δφ_(ob) _(—) _(f)−Δφ_(i) _(—) _(f)∥_(f) ², i.e. as the sum

$e_{i} = {\sum\limits_{f \in F}\left( {{\Delta \; \phi_{{ob}\; \_ \; f}} - {\Delta \; \phi_{i\; \_ \; f}}} \right)^{2}}$

of the squared differences between the observed and candidate phasedelay values over a desired range or other set F of frequencycomponents. The phase delay values Δφ_(i) _(—) _(f) for each DOAcandidate θ_(i) may be calculated before run-time (e.g., during designor manufacture), according to known values of c and d and the desiredrange of frequency components f, and retrieved from storage during useof the device. Such a pre-calculated inventory may be configured tosupport a desired angular range and resolution (e.g., a uniformresolution, such as one, two, five, or ten degrees; or a desirednonuniform resolution) and a desired frequency range and resolution(which may also be uniform or nonuniform).

It may be desirable to calculate the error e_(i) across as manyfrequency bins as possible to increase robustness against noise. Forexample, it may be desirable for the error calculation to include termsfrom frequency bins that are beyond the spatial aliasing frequency. In apractical application, the maximum frequency bin may be limited by otherfactors, which may include available memory, computational complexity,strong reflection by a rigid body at high frequencies, etc.

A speech signal is typically sparse in the time-frequency domain. If thesources are disjoint in the frequency domain, then two sources can betracked at the same time. If the sources are disjoint in the timedomain, then two sources can be tracked at the same frequency. It may bedesirable for the array to include a number of microphones that is atleast equal to the number of different source directions to bedistinguished at any one time. The microphones may be omnidirectional(e.g., as may be typical for a cellular telephone or a dedicatedconferencing device) or directional (e.g., as may be typical for adevice such as a set-top box).

Such multichannel processing is generally applicable, for example, tosource tracking for speakerphone applications. Such a technique may beused to calculate a DOA estimate for a frame of the receivedmultichannel signal. Such an approach may calculate, at each frequencybin, the error for each candidate angle with respect to the observedangle, which is indicated by the phase delay. The target angle at thatfrequency bin is the candidate having the minimum error. In one example,the error is then summed across the frequency bins to obtain a measureof likelihood for the candidate. In another example, one or more of themost frequently occurring target DOA candidates across all frequencybins is identified as the DOA estimate (or estimates) for a given frame.

Such a method may be applied to obtain instantaneous tracking results(e.g., with a delay of less than one frame). The delay is dependent onthe FFT size and the degree of overlap. For example, for a 512-point FFTwith a 50% overlap and a sampling frequency of 16 kHz, the resulting256-sample delay corresponds to sixteen milliseconds. Such a method maybe used to support differentiation of source directions typically up toa source-array distance of two to three meters, or even up to fivemeters.

The error may also be considered as a variance (i.e., the degree towhich the individual errors deviate from an expected value). Conversionof the time-domain received signal into the frequency domain (e.g., byapplying an FFT) has the effect of averaging the spectrum in each bin.This averaging is even more obvious if a subband representation is used(e.g., mel scale or Bark scale). Additionally, it may be desirable toperform time-domain smoothing on the DOA estimates (e.g., by applying asrecursive smoother, such as a first-order infinite-impulse-responsefilter).

It may be desirable to reduce the computational complexity of the errorcalculation operation (e.g., by using a search strategy, such as abinary tree, and/or applying known information, such as DOA candidateselections from one or more previous frames).

Even though the directional information may be measured in terms ofphase delay, it is typically desired to obtain a result that indicatessource DOA. Consequently, it may be desirable to calculate the error interms of DOA rather than in terms of phase delay.

An expression of error e_(i) in terms of DOA may be derived by assumingthat an expression for the observed wrapped phase delay as a function ofDOA, such as

${{\Psi_{f\; \_ \; {wr}}(\theta)} = {{{mod}\left( {{{{- 2}\pi \; f\; \frac{d\; \sin \; \theta}{c}} + \pi},{2\pi}} \right)} - \pi}},$

is equivalent to a corresponding expression for unwrapped phase delay asa function of DOA, such as

${{\Psi_{f\; \_ \; {un}}(\theta)} = {{- 2}\pi \; f\; \frac{d\; \sin \; \theta}{c}}},$

except near discontinuities that are due to phase wrapping. The errore_(i) may then be expressed as

e _(i)=∥Ψ_(f) _(—) _(wr)(θ_(ob))−ω_(f) _(—) _(wr)(θ_(i))∥_(f) ²≡∥ω_(f)_(—) _(un)(θ_(ob))−Ψ_(f) _(un) (θ_(i))∥_(f′) ²

where the difference between the observed and candidate phase delay atfrequency f is expressed in terms of DOA as

${{\Psi_{f\; \_ \; {un}}\left( \theta_{ob} \right)} - {\Psi_{f\; \_ \; u\; n}\left( \theta_{i} \right)}} = {\frac{{- 2}\; \pi \; {fd}}{c}{\left( {{\sin \; \theta_{{ob}\; \_ \; f}} - {\sin \; \theta_{i}}} \right).}}$

Perform a Taylor series expansion to obtain the following first-orderapproximation:

${{\frac{{- 2}\pi \; f\; d}{c}\left( {{\sin \; \theta_{{ob}\; \_ \; f}} - {\sin \; \theta_{i}}} \right)} \approx {\left( {\theta_{{ob}\; \_ \; f} - \theta_{i}} \right)\; \frac{{- 2}\pi \; {fd}}{c}\cos \; \theta_{i}}},$

which is used to obtain an expression of the difference between the DOAθ_(ob) _(—) _(f) θ_(ob) _(—) _(f) as observed at frequency f and DOAcandidate θ_(i):

$\left( {\theta_{{ob}\; \_ \; f} - \theta_{i}} \right) \cong {\frac{{\Psi_{f\; \_ \; {un}}\left( \theta_{ob} \right)} - {\Psi_{f\; \_ \; {un}}\left( \theta_{i} \right)}}{\frac{2\pi \; f\; d}{c}\cos \; \theta_{i}}.}$

This expression may be used, with the assumed equivalence of observedwrapped phase delay to unwrapped phase delay, to express error e_(i) interms of DOA:

${e_{i} = {{{\theta_{ob} - \theta_{i}}}_{f}^{2} \cong \frac{{{{\Psi_{f\; \_ \; {wr}}\left( \theta_{ob} \right)} - {\Psi_{f\; \_ \; {wr}}\left( \theta_{i} \right)}}}_{f}^{2}}{{{\frac{2\pi \; f\; d}{c}\cos \; \theta_{i}}}_{f}^{2}}}},$

where the values of [ψ_(f) _(—) _(wr)(θ_(ob)),ψ_(f) _(—) _(wr)(θ_(i))]are defined as [Δφ_(ob) _(—) _(f),Δφ_(i) _(—) _(f)].

To avoid division with zero at the endfire directions (θ=+/−90°), it maybe desirable to perform such an expansion using a second-orderapproximation instead, as in the following:

${{\theta_{ob} - \theta_{i}}} \cong \left\{ {{{\begin{matrix}{{{{- C}/B}},} & {\theta_{i} = {0({broadside})}} \\{{\frac{{- B} + \sqrt{B^{2} - {4A\; C}}}{2A}},} & {{otherwise},}\end{matrix}{where}\mspace{14mu} A} = {\left( {\pi \; {fd}\; \sin \; \theta_{i}} \right)/c}},{B = {\left( {{- 2}\pi \; f\; d\; \cos \; \theta_{i}} \right)/c}},{{{and}C} = {- {\left( {{\Psi_{f\; \_ \; {un}}\left( \theta_{ob} \right)} - {\Psi_{f\; \_ \; {un}}\left( \theta_{i} \right)}} \right).}}}} \right.$

As in the first-order example above, this expression may be used, withthe assumed equivalence of observed wrapped phase delay to unwrappedphase delay, to express error e_(i) in terms of DOA as a function of theobserved and candidate wrapped phase delay values.

As shown in FIG. 27, a difference between observed and candidate DOA fora given frame of the received signal may be calculated in such manner ateach of a plurality of frequencies f of the received microphone signals(e.g., ∀fδF) and for each of a plurality of DOA candidates θ_(i). Asdemonstrated in FIG. 28, a DOA estimate for a given frame may bedetermined by summing the squared differences for each candidate acrossall frequency bins in the frame to obtain the error e_(i) and selectingthe DOA candidate having the minimum error. Alternatively, asdemonstrated in FIG. 29, such differences may be used to identify thebest-matched (i.e. minimum squared difference) DOA candidate at eachfrequency. A DOA estimate for the frame may then be determined as themost frequent DOA across all frequency bins.

As shown in FIG. 31, an error term may be calculated for each candidateangle i and each of a set F of frequencies for each frame k. It may bedesirable to indicate a likelihood of source activity in terms of acalculated DOA difference or error. One example of such a likelihood Lmay be expressed, for a particular frame, frequency, and angle, as

$\begin{matrix}{{L\left( {i,f,k} \right)} = {\frac{1}{{{\theta_{ob} - \theta_{i}}}_{f,k}^{2}}.}} & (1)\end{matrix}$

For expression (1), an extremely good match at a particular frequencymay cause a corresponding likelihood to dominate all others. To reducethis susceptibility, it may be desirable to include a regularizationterm A, as in the following expression:

$\begin{matrix}{{L\left( {i,f,k} \right)} = {\frac{1}{{{\theta_{ob} - \theta_{i}}}_{f,k}^{2} + \lambda}.}} & (2)\end{matrix}$

Speech tends to be sparse in both time and frequency, such that a sumover a set of frequencies F may include results from bins that aredominated by noise. It may be desirable to include a bias term β, as inthe following expression:

$\begin{matrix}{{L\left( {i,f,k} \right)} = {\frac{1}{{{\theta_{ob} - \theta_{i}}}_{f,k}^{2} + \lambda} - {\beta.}}} & (3)\end{matrix}$

The bias term, which may vary over frequency and/or time, may be basedon an assumed distribution of the noise (e.g., Gaussian). Additionallyor alternatively, the bias term may be based on an initial estimate ofthe noise (e.g., from a noise-only initial frame). Additionally oralternatively, the bias term may be updated dynamically based oninformation from noise-only frames, as indicated, for example, by avoice activity detection module.

The frequency-specific likelihood results may be projected onto a(frame, angle) plane to obtain a DOA estimation per frame θ_(est) _(—)_(k)=max_(i)Σ_(fεF)L(i,f,k) that is robust to noise and reverberationbecause only target dominant frequency bins contribute to the estimate.In this summation, terms in which the error is large have values thatapproach zero and thus become less significant to the estimate. If adirectional source is dominant in some frequency bins, the error valueat those frequency bins will be nearer to zero for that angle. Also, ifanother directional source is dominant in other frequency bins, theerror value at the other frequency bins will be nearer to zero for theother angle.

The likelihood results may also be projected onto a (frame, frequency)plane to indicate likelihood information per frequency bin, based ondirectional membership (e.g., for voice activity detection). Thislikelihood may be used to indicate likelihood of speech activity.Additionally or alternatively, such information may be used, forexample, to support time- and/or frequency-selective masking of thereceived signal by classifying frames and/or frequency componentsaccording to their direction of arrival.

An anglogram representation is similar to a spectrogram representation.An anglogram may be obtained by plotting, at each frame, a likelihood ofthe current DOA candidate at each frequency

A microphone pair having a large spacing is typically not suitable forhigh frequencies, because spatial aliasing begins at a low frequency forsuch a pair. A DOA estimation approach as described herein, however,allows the use of phase delay measurements beyond the frequency at whichphase wrapping begins, and even up to the Nyquist frequency (i.e., halfof the sampling rate). By relaxing the spatial aliasing constraint, suchan approach enables the use of microphone pairs having largerinter-microphone spacings. As an array with a large inter-microphonedistance typically provides better directivity at low frequencies thanan array with a small inter-microphone distance, use of a larger arraytypically extends the range of useful phase delay measurements intolower frequencies as well.

The DOA estimation principles described herein may be extended tomultiple microphone pairs in a linear array (e.g., as shown in FIG. 22).One example of such an application for a far-field scenario is a lineararray of microphones arranged along the margin of a television or otherlarge-format video display screen (e.g., as shown in FIG. 26). It may bedesirable to configure such an array to have a nonuniform (e.g.,logarithmic) spacing between microphones, as in the examples of FIGS. 22and 26.

For a far-field source, the multiple microphone pairs of a linear arraywill have essentially the same DOA. Accordingly, one option is toestimate the DOA as an average of the DOA estimates from two or morepairs in the array. However, an averaging scheme may be affected bymismatch of even a single one of the pairs, which may reduce DOAestimation accuracy. Alternatively, it may be desirable to select, fromamong two or more pairs of microphones of the array, the best microphonepair for each frequency (e.g., the pair that gives the minimum errore_(i) at that frequency), such that different microphone pairs may beselected for different frequency bands. At the spatial aliasingfrequency of a microphone pair, the error will be large. Consequently,such an approach will tend to automatically avoid a microphone pair whenthe frequency is close to its wrapping frequency, thus avoiding therelated uncertainty in the DOA estimate. For higher-frequency bins, apair having a shorter distance between the microphones will typicallyprovide a better estimate and may be automatically favored, while forlower-frequency bins, a pair having a larger distance between themicrophones will typically provide a better estimate and may beautomatically favored. In the four-microphone example shown in FIG. 22,six different pairs of microphones are possible (i.e.

$\left. {,{\begin{pmatrix}4 \\2\end{pmatrix} = 6}} \right).$

In one example, the best pair for each axis is selected by calculating,for each frequency f, Pxl values, where P is the number of pairs, I isthe size of the inventory, and each value e_(pi) is the squared absolutedifference between the observed angle θ_(pf) (for pair p and frequencyf) and the candidate angle θ_(if). For each frequency f, the pair p thatcorresponds to the lowest error value e_(pi) is selected. This errorvalue also indicates the best DOA candidate θ_(i) at frequency f (asshown in FIG. 30).

The signals received by a microphone pair may be processed as describedherein to provide an estimated DOA, over a range of up to 180 degrees,with respect to the axis of the microphone pair. The desired angularspan and resolution may be arbitrary within that range (e.g. uniform(linear) or nonuniform (nonlinear), limited to selected sectors ofinterest, etc.). Additionally or alternatively, the desired frequencyspan and resolution may be arbitrary (e.g. linear, logarithmic,mel-scale, Bark-scale, etc.).

In the model shown in FIG. 22, each DOA estimate between 0 and +/−90degrees from a microphone pair indicates an angle relative to a planethat is orthogonal to the axis of the pair. Such an estimate describes acone around the axis of the pair, and the actual direction of the sourcealong the surface of this cone is indeterminate. For example, a DOAestimate from a single microphone pair does not indicate whether thesource is in front of or behind the microphone pair. Therefore, whilemore than two microphones may be used in a linear array to improve DOAestimation performance across a range of frequencies, the range of DOAestimation supported by a linear array is typically limited to 180degrees.

The DOA estimation principles described herein may also be extended to atwo-dimensional (2-D) array of microphones. For example, a 2-D array maybe used to extend the range of source DOA estimation up to a full 360°(e.g., providing a similar range as in applications such as radar andbiomedical scanning). Such an array may be used in a speakerphoneapplication, for example, to support good performance even for arbitraryplacement of the telephone relative to one or more sources.

The multiple microphone pairs of a 2-D array typically will not sharethe same DOA, even for a far-field point source. For example, sourceheight relative to the plane of the array (e.g., in the z-axis) may playan important role in 2-D tracking. FIG. 32 shows an example of aspeakerphone application in which the x-y plane as defined by themicrophone axes is parallel to a surface (e.g., a tabletop) on which thetelephone is placed. In this example, the source is a person speakingfrom a location that is along the x axis but is offset in the directionof the z axis (e.g., the speaker's mouth is above the tabletop). Withrespect to the x-y plane as defined by the microphone array, thedirection of the source is along the x axis, as shown in FIG. 32. Themicrophone pair along the y axis estimates a DOA of the source as zerodegrees from the x-z plane. Due to the height of the speaker above thex-y plane, however, the microphone pair along the x axis estimates a DOAof the source as 30° from the x axis (i.e., 60 degrees from the y-zplane), rather than along the x axis. FIGS. 34 and 35 shows two views ofthe cone of confusion associated with this DOA estimate, which causes anambiguity in the estimated speaker direction with respect to themicrophone axis.

An expression such as

$\begin{matrix}{\left\lbrack {{\tan^{- 1}\left( \frac{\sin \; \theta_{1}}{\sin \; \theta_{2}} \right)},{\tan^{- 1}\left( \frac{\sin \; \theta_{2}}{\sin \; \theta_{1}} \right)}} \right\rbrack,} & (4)\end{matrix}$

where θ₁ and θ₂ are the estimated DOA for pair 1 and 2, respectively,may be used to project all pairs of DOAs to a 360° range in the plane inwhich the three microphones are located. Such projection may be used toenable tracking directions of active speakers over a 360° range aroundthe microphone array, regardless of height difference. Applying theexpression above to project the DOA estimates (0°, 60°) of FIG. 32 intothe x-y plane produces

${\left\lbrack {{\tan^{- 1}\left( \frac{\sin \; 0{^\circ}}{\sin \; 60{^\circ}} \right)},{\tan^{- 1}\left( \frac{\sin \; 60{^\circ}}{\sin \; 0{^\circ}} \right)}} \right\rbrack = \left( {{0{^\circ}},{90{^\circ}}} \right)},$

which may be mapped to a combined directional estimate (e.g., anazimuth) of 270° as shown in FIG. 33.

In a typical use case, the source will be located in a direction that isnot projected onto a microphone axis. FIGS. 37-40 show such an examplein which the source is located above the plane of the microphones. Inthis example, the DOA of the source signal passes through the point(x,y,z)=(5,2,5). FIG. 37 shows the x-y plane as viewed from the +zdirection, FIGS. 38 and 40 show the x-z plane as viewed from thedirection of microphone MC30, and FIG. 39 shows the y-z plane as viewedfrom the direction of microphone MC10. The shaded area in FIG. 37indicates the cone of confusion CY associated with the DOA θ₁ asobserved by the y-axis microphone pair MC20-MC30, and the shaded area inFIG. 38 indicates the cone of confusion CX associated with the DOA θ₂ asobserved by the x-axis microphone pair MC10-MC20. In FIG. 39, the shadedarea indicates cone CY, and the dashed circle indicates the intersectionof cone CX with a plane that passes through the source and is orthogonalto the x axis. The two dots on this circle that indicate itsintersection with cone CY are the candidate locations of the source.Likewise, in FIG. 40 the shaded area indicates cone CX, the dashedcircle indicates the intersection of cone CY with a plane that passesthrough the source and is orthogonal to the y axis, and the two dots onthis circle that indicate its intersection with cone CX are thecandidate locations of the source. It may be seen that in this 2-D case,an ambiguity remains with respect to whether the source is above orbelow the x-y plane.

For the example shown in FIGS. 37-40, the DOA observed by the x-axismicrophone pair MC10-MC20 is θ₂=tan⁻¹(−5/√{square root over(25+4)})≈−42.9°, and the DOA observed by the y-axis microphone pairMC20-MC30 is θ₁=tan⁻¹(−2/√{square root over (25+25)})≈−15.8°. Usingexpression (4) to project these directions into the x-y plane producesthe magnitudes (21.8°, 68.2°) of the desired angles relative to the xand y axes, respectively, which corresponds to the given source location(x,y,z)=(5,2,5). The signs of the observed angles indicate the x-yquadrant in which the source is located, as shown in FIG. 36.

In fact, almost 3D information is given by a 2D microphone array, exceptfor the up-down confusion. For example, the directions of arrivalobserved by microphone pairs MC10-MC20 and MC20-MC30 may also be used toestimate the magnitude of the angle of elevation of the source relativeto the x-y plane. If d denotes the vector from microphone MC20 to thesource, then the lengths of the projections of vector d onto the x-axis,the y-axis, and the x-y plane may be expressed as d sin(θ₂), d sin(θ₁),and d√{square root over (sin²(θ₁)+sin²(θ₂))}{square root over(sin²(θ₁)+sin²(θ₂))}, respectively. The magnitude of the angle ofelevation may then be estimated as {circumflex over(θ)}_(h)=cos⁻¹√{square root over (sin²(θ₁)+sin²(θ₂))}{square root over(sin²(θ₁)+sin²(θ₂))}.

Although the microphone pairs in the particular examples of FIGS. 32-33and 37-40 have orthogonal axes, it is noted that for microphone pairshaving non-orthogonal axes, expression (4) may be used to project theDOA estimates to those non-orthogonal axes, and from that point it isstraightforward to obtain a representation of the combined directionalestimate with respect to orthogonal axes. FIG. 41 shows a example ofmicrophone array MC10-MC20-MC30 in which the axis 1 of pair MC20-MC30lies in the x-y plane and is skewed relative to the y axis by a skewangle θ₀.

FIG. 42 shows an example of obtaining a combined directional estimate inthe x-y plane with respect to orthogonal axes x and y with observations(θ₁, θ₂) from an array as shown in FIG. 41. If d denotes the vector frommicrophone MC20 to the source, then the lengths of the projections ofvector d onto the x-axis and axis 1 may be expressed as d sin(θ₂) and dsin(θ₁), respectively. The vector (x,y) denotes the projection of vectord onto the x-y plane. The estimated value of x is known, and it remainsto estimate the value of y.

The estimation of y may be performed using the projection p₁=(d sin θ₁sin θ₀, d sin θ₁ cos θ₀) of vector (x,y) onto axis 1. Observing that thedifference between vector (x,y) and vector p₁ is orthogonal to p₁,calculate y as

$y = {d\; {\frac{{\sin \; \theta_{1}} - {\sin \; \theta_{2}\sin \; \theta_{0}}}{\cos \; \theta_{0}}.}}$

The desired angles of arrival in the x-y plane, relative to theorthogonal x and y axes, may then be expressed respectively as

$\left( {{\tan^{- 1}\left( \frac{y}{x} \right)},{\tan^{- 1}\left( \frac{x}{y} \right)}} \right) = {\begin{pmatrix}{{\tan^{- 1}\left( \frac{{\sin \; \theta_{1}} - {\sin \; \theta_{2}\sin \; \theta_{0}}}{\sin \; \theta_{2}\cos \; \theta_{0}} \right)},} \\{\tan^{- 1}\left( \frac{\sin \; \theta_{2}\cos \; \theta_{0}}{{\sin \; \theta_{1}} - {\sin \; \theta_{2}\sin \; \theta_{0}}} \right)}\end{pmatrix}.}$

Extension of DOA estimation to a 2-D array is typically well-suited toand sufficient for a speakerphone application. However, furtherextension to an N-dimensional array is also possible and may beperformed in a straightforward manner. For tracking applications inwhich one target is dominant, it may be desirable to select N pairs forrepresenting N dimensions. Once a 2-D result is obtained with aparticular microphone pair, another available pair can be utilized toincrease degrees of freedom. For example, FIGS. 37-42 illustrate use ofobserved DOA estimates from different microphone pairs in the x-y planeto obtain an estimate of the source direction as projected into the x-yplane. In the same manner, observed DOA estimates from an x-axismicrophone pair and a z-axis microphone pair (or other pairs in the x-zplane) may be used to obtain an estimate of the source direction asprojected into the x-z plane, and likewise for the y-z plane or anyother plane that intersects three or more of the microphones.

Estimates of DOA error from different dimensions may be used to obtain acombined likelihood estimate, for example, using an expression such as

$\frac{1}{{\max \left( {{{\theta - \theta_{0,1}}}_{f,1}^{2},{{\theta - \theta_{0,2}}}_{f,2}^{2}} \right)} + \lambda}\mspace{14mu} {or}$$\frac{1}{{{mean}\left( {{{\theta - \theta_{0,1}}}_{f,1}^{2},{{\theta - \theta}}_{f,2}^{2}} \right)} + \lambda},$

where θ_(0,i) denotes the DOA candidate selected for pair i. Use of themaximum among the different errors may be desirable to promote selectionof an estimate that is close to the cones of confusion of bothobservations, in preference to an estimate that is close to only one ofthe cones of confusion and may thus indicate a false peak. Such acombined result may be used to obtain a (frame, angle) plane, asdescribed herein, and/or a (frame, frequency) plot, as described herein.

The DOA estimation principles described herein may be used to supportselection among multiple speakers. For example, location of multiplesources may be combined with a manual selection of a particular speaker(e.g., push a particular button to select a particular correspondinguser) or automatic selection of a particular speaker (e.g., by speakerrecognition). In one such application, a telephone is configured torecognize the voice of its owner and to automatically select a directioncorresponding to that voice in preference to the directions of othersources.

A source DOA may be easily defined in 1-D, e.g. from −90° to +90°. Formore than two microphones at arbitrary relative locations, it isproposed to use a straightforward extension of 1-D as described above,e.g. (θ₁, θ₂) in two-pair case in 2-D, (θ₁, θ₂, θ₃) in three-pair casein 3-D, etc.

A key problem is how to apply spatial filtering to such a combination ofpaired 1-D DOA estimates. In this case, a beamformer/null beamformer(BFNF) as shown in FIG. 43 may be applied by augmenting the steeringvector for each pair. In this figure, A^(H) denotes the conjugatetranspose of A, x denotes the microphone channels, and y denotes thespatially filtered channels. Using a pseudo-inverse operationA+(A^(H)A)⁻¹A^(H) as shown in FIG. 43 allows the use of a non-squarematrix. For a three-microphone case (i.e., two microphone pairs) asillustrated in FIG. 45, for example, the number of rows 2*2=4 instead of3, such that the additional row makes the matrix non-square.

As the approach shown in FIG. 43 is based on robust 1-D DOA estimation,complete knowledge of the microphone geometry is not required, and DOAestimation using all microphones at the same time is also not required.Such an approach is well-suited for use with anglogram-based DOAestimation as described herein, although any other 1-D DOA estimationmethod can also be used. FIG. 44 shows an example of the BFNF as shownin FIG. 43 which also includes a normalization factor to prevent anill-conditioned inversion at the spatial aliasing frequency.

FIG. 46 shows an example of a pair-wise (PW) normalized MVDR (minimumvariance distortionless response) BFNF, in which the manner in which thesteering vector (array manifold vector) is obtained differs from theconventional approach. In this case, a common channel is eliminated dueto sharing of a microphone between the two pairs. The noise coherencematrix Γ may be obtained either by measurement or by theoreticalcalculation using a sinc function. It is noted that the examples ofFIGS. 43, 44, and 46 may be generalized to an arbitrary number ofsources N such that N<=M, where M is the number of microphones.

FIG. 47 shows another example that may be used if the matrix A^(H)A isnot ill-conditioned, which may be determined using a condition number ordeterminant of the matrix. If the matrix is ill-conditioned, it may bedesirable to bypass one microphone signal for that frequency bin for useas the source channel, while continuing to apply the method to spatiallyfilter other frequency bins in which the matrix A^(H)A is notill-conditioned. This option saves computation for calculating adenominator for normalization. The methods in FIGS. 43-47 demonstrateBFNF techniques that may be applied independently at each frequency bin.The steering vectors are constructed using the DOA estimates for eachfrequency and microphone pair as described herein. For example, eachelement of the steering vector for pair p and source n for DOA θ_(i),frequency f, and microphone number m (1 or 2) may be calculated as

${d_{p,m}^{n} = {\exp \left( {\frac{{- j}\; \omega \; {f_{s}\left( {m - 1} \right)}l_{p}}{c}\cos \; \theta_{i}} \right)}},$

where l_(p) indicates the distance between the microphones of pair p, ωindicates the frequency bin number, and f_(s) indicates the samplingfrequency. FIG. 48 shows examples of steering vectors for an array asshown in FIG. 45.

A PWBFNF scheme may be used for suppressing direct path of interferersup to the available degrees of freedom (instantaneous suppressionwithout smooth trajectory assumption, additional noise-suppression gainusing directional masking, additional noise-suppression gain usingbandwidth extension). Single-channel post-processing of quadrantframework may be used for stationary noise and noise-reference handling.

It may be desirable to obtain instantaneous suppression but also toprovide minimization of artifacts such as musical noise. It may bedesirable to maximally use the available degrees of freedom for BFNF.One DOA may be fixed across all frequencies, or a slightly mismatchedalignment across frequencies may be permitted. Only the current framemay be used, or a feed-forward network may be implemented. The BFNF maybe set for all frequencies in the range up to the Nyquist rate (e.g.,except ill-conditioned frequencies). A natural masking approach may beused (e.g., to obtain a smooth natural seamless transition ofaggressiveness).

FIG. 49 shows a flowchart for one example of an integrated method asdescribed herein. This method includes an inventory matching task forphase delay estimation, a variance calculation task to obtain DOA errorvariance values, a dimension-matching and/or pair-selection task, and atask to map DOA error variance for the selected DOA candidate to asource activity likelihood estimate. The pair-wise DOA estimationresults may also be used to track one or more active speakers, toperform a pair-wise spatial filtering operation, and or to perform time-and/or frequency-selective masking. The activity likelihood estimationand/or spatial filtering operation may also be used to obtain a noiseestimate to support a single-channel noise suppression operation.

The methods and apparatus disclosed herein may be applied generally inany transceiving and/or audio sensing application, especially mobile orotherwise portable instances of such applications. For example, therange of configurations disclosed herein includes communications devicesthat reside in a wireless telephony communication system configured toemploy a code-division multiple-access (CDMA) over-the-air interface.Nevertheless, it would be understood by those skilled in the art that amethod and apparatus having features as described herein may reside inany of the various communication systems employing a wide range oftechnologies known to those of skill in the art, such as systemsemploying Voice over IP (VoIP) over wired and/or wireless (e.g., CDMA,TDMA, FDMA, and/or TD-SCDMA) transmission channels.

It is expressly contemplated and hereby disclosed that communicationsdevices disclosed herein may be adapted for use in networks that arepacket-switched (for example, wired and/or wireless networks arranged tocarry audio transmissions according to protocols such as VoIP) and/orcircuit-switched. It is also expressly contemplated and hereby disclosedthat communications devices disclosed herein may be adapted for use innarrowband coding systems (e.g., systems that encode an audio frequencyrange of about four or five kilohertz) and/or for use in wideband codingsystems (e.g., systems that encode audio frequencies greater than fivekilohertz), including whole-band wideband coding systems and split-bandwideband coding systems.

Examples of codecs that may be used with, or adapted for use with,transmitters and/or receivers of communications devices as describedherein include the Enhanced Variable Rate Codec, as described in theThird Generation Partnership Project 2 (3GPP2) document C.S0014-C, v1.0,entitled “Enhanced Variable Rate Codec, Speech Service Options 3, 68,and 70 for Wideband Spread Spectrum Digital Systems,” February 2007(available online at www-dot-3gpp-dot-org); the Selectable Mode Vocoderspeech codec, as described in the 3GPP2 document C.S0030-0, v3.0,entitled “Selectable Mode Vocoder (SMV) Service Option for WidebandSpread Spectrum Communication Systems,” January 2004 (available onlineat www-dot-3gpp-dot-org); the Adaptive Multi Rate (AMR) speech codec, asdescribed in the document ETSI TS 126 092 V6.0.0 (EuropeanTelecommunications Standards Institute (ETSI), Sophia Antipolis Cedex,FR, December 2004); and the AMR Wideband speech codec, as described inthe document ETSI TS 126 192 V6.0.0 (ETSI, December 2004). Such a codecmay be used, for example, to recover the reproduced audio signal from areceived wireless communications signal.

The presentation of the described configurations is provided to enableany person skilled in the art to make or use the methods and otherstructures disclosed herein. The flowcharts, block diagrams, and otherstructures shown and described herein are examples only, and othervariants of these structures are also within the scope of thedisclosure. Various modifications to these configurations are possible,and the generic principles presented herein may be applied to otherconfigurations as well. Thus, the present disclosure is not intended tobe limited to the configurations shown above but rather is to beaccorded the widest scope consistent with the principles and novelfeatures disclosed in any fashion herein, including in the attachedclaims as filed, which form a part of the original disclosure.

Those of skill in the art will understand that information and signalsmay be represented using any of a variety of different technologies andtechniques. For example, data, instructions, commands, information,signals, bits, and symbols that may be referenced throughout the abovedescription may be represented by voltages, currents, electromagneticwaves, magnetic fields or particles, optical fields or particles, or anycombination thereof.

Important design requirements for implementation of a configuration asdisclosed herein may include minimizing processing delay and/orcomputational complexity (typically measured in millions of instructionsper second or MIPS), especially for computation-intensive applications,such as playback of compressed audio or audiovisual information (e.g., afile or stream encoded according to a compression format, such as one ofthe examples identified herein) or applications for widebandcommunications (e.g., voice communications at sampling rates higher thaneight kilohertz, such as 12, 16, 32, 44.1, 48, or 192 kHz).

An apparatus as disclosed herein (e.g., any device configured to performa technique as described herein) may be implemented in any combinationof hardware with software, and/or with firmware, that is deemed suitablefor the intended application. For example, the elements of such anapparatus may be fabricated as electronic and/or optical devicesresiding, for example, on the same chip or among two or more chips in achipset. One example of such a device is a fixed or programmable arrayof logic elements, such as transistors or logic gates, and any of theseelements may be implemented as one or more such arrays. Any two or more,or even all, of these elements may be implemented within the same arrayor arrays. Such an array or arrays may be implemented within one or morechips (for example, within a chipset including two or more chips).

One or more elements of the various implementations of the apparatusdisclosed herein may be implemented in whole or in part as one or moresets of instructions arranged to execute on one or more fixed orprogrammable arrays of logic elements, such as microprocessors, embeddedprocessors, IP cores, digital signal processors, FPGAs(field-programmable gate arrays), ASSPs (application-specific standardproducts), and ASICs (application-specific integrated circuits). Any ofthe various elements of an implementation of an apparatus as disclosedherein may also be embodied as one or more computers (e.g., machinesincluding one or more arrays programmed to execute one or more sets orsequences of instructions, also called “processors”), and any two ormore, or even all, of these elements may be implemented within the samesuch computer or computers.

A processor or other means for processing as disclosed herein may befabricated as one or more electronic and/or optical devices residing,for example, on the same chip or among two or more chips in a chipset.One example of such a device is a fixed or programmable array of logicelements, such as transistors or logic gates, and any of these elementsmay be implemented as one or more such arrays. Such an array or arraysmay be implemented within one or more chips (for example, within achipset including two or more chips). Examples of such arrays includefixed or programmable arrays of logic elements, such as microprocessors,embedded processors, IP cores, DSPs, FPGAs, ASSPs, and ASICs. Aprocessor or other means for processing as disclosed herein may also beembodied as one or more computers (e.g., machines including one or morearrays programmed to execute one or more sets or sequences ofinstructions) or other processors. It is possible for a processor asdescribed herein to be used to perform tasks or execute other sets ofinstructions that are not directly related to a procedure of animplementation described herein, such as a task relating to anotheroperation of a device or system in which the processor is embedded(e.g., an audio sensing device). It is also possible for part of amethod as disclosed herein to be performed by a processor of the audiosensing device and for another part of the method to be performed underthe control of one or more other processors.

Those of skill will appreciate that the various illustrative modules,logical blocks, circuits, and tests and other operations described inconnection with the configurations disclosed herein may be implementedas electronic hardware, computer software, or combinations of both. Suchmodules, logical blocks, circuits, and operations may be implemented orperformed with a general purpose processor, a digital signal processor(DSP), an ASIC or ASSP, an FPGA or other programmable logic device,discrete gate or transistor logic, discrete hardware components, or anycombination thereof designed to produce the configuration as disclosedherein. For example, such a configuration may be implemented at least inpart as a hard-wired circuit, as a circuit configuration fabricated intoan application-specific integrated circuit, or as a firmware programloaded into non-volatile storage or a software program loaded from orinto a data storage medium as machine-readable code, such code beinginstructions executable by an array of logic elements such as a generalpurpose processor or other digital signal processing unit. A generalpurpose processor may be a microprocessor, but in the alternative, theprocessor may be any conventional processor, controller,microcontroller, or state machine. A processor may also be implementedas a combination of computing devices, e.g., a combination of a DSP anda microprocessor, a plurality of microprocessors, one or moremicroprocessors in conjunction with a DSP core, or any other suchconfiguration. A software module may reside in a non-transitory storagemedium such as RAM (random-access memory), ROM (read-only memory),nonvolatile RAM (NVRAM) such as flash RAM, erasable programmable ROM(EPROM), electrically erasable programmable ROM (EEPROM), registers,hard disk, a removable disk, or a CD-ROM; or in any other form ofstorage medium known in the art. An illustrative storage medium iscoupled to the processor such the processor can read information from,and write information to, the storage medium. In the alternative, thestorage medium may be integral to the processor. The processor and thestorage medium may reside in an ASIC. The ASIC may reside in a userterminal. In the alternative, the processor and the storage medium mayreside as discrete components in a user terminal.

It is noted that the various methods disclosed herein may be performedby an array of logic elements such as a processor, and that the variouselements of an apparatus as described herein may be implemented asmodules designed to execute on such an array. As used herein, the term“module” or “sub-module” can refer to any method, apparatus, device,unit or computer-readable data storage medium that includes computerinstructions (e.g., logical expressions) in software, hardware orfirmware form. It is to be understood that multiple modules or systemscan be combined into one module or system and one module or system canbe separated into multiple modules or systems to perform the samefunctions. When implemented in software or other computer-executableinstructions, the elements of a process are essentially the codesegments to perform the related tasks, such as with routines, programs,objects, components, data structures, and the like. The term “software”should be understood to include source code, assembly language code,machine code, binary code, firmware, macrocode, microcode, any one ormore sets or sequences of instructions executable by an array of logicelements, and any combination of such examples. The program or codesegments can be stored in a processor readable medium or transmitted bya computer data signal embodied in a carrier wave over a transmissionmedium or communication link.

Each of the tasks of the methods described herein may be embodieddirectly in hardware, in a software module executed by a processor, orin a combination of the two. In a typical application of animplementation of a method as disclosed herein, an array of logicelements (e.g., logic gates) is configured to perform one, more thanone, or even all of the various tasks of the method. One or more(possibly all) of the tasks may also be implemented as code (e.g., oneor more sets of instructions), embodied in a computer program product(e.g., one or more data storage media such as disks, flash or othernonvolatile memory cards, semiconductor memory chips, etc.), that isreadable and/or executable by a machine (e.g., a computer) including anarray of logic elements (e.g., a processor, microprocessor,microcontroller, or other finite state machine). The tasks of animplementation of a method as disclosed herein may also be performed bymore than one such array or machine. In these or other implementations,the tasks may be performed within a device for wireless communicationssuch as a cellular telephone or other device having such communicationscapability. Such a device may be configured to communicate withcircuit-switched and/or packet-switched networks (e.g., using one ormore protocols such as VoIP). For example, such a device may include RFcircuitry configured to receive and/or transmit encoded frames.

It is expressly disclosed that the various methods disclosed herein maybe performed by a portable communications device such as a handset,headset, or portable digital assistant (PDA), and that the variousapparatus described herein may be included within such a device.

In one or more exemplary embodiments, the operations described hereinmay be implemented in hardware, software, firmware, or any combinationthereof. If implemented in software, such operations may be stored on ortransmitted over a computer-readable medium as one or more instructionsor code. The term “computer-readable media” includes bothcomputer-readable storage media and communication (e.g., transmission)media. By way of example, and not limitation, computer-readable storagemedia can comprise an array of storage elements, such as semiconductormemory (which may include without limitation dynamic or static RAM, ROM,EEPROM, and/or flash RAM), or ferroelectric, magnetoresistive, ovonic,polymeric, or phase-change memory; CD-ROM or other optical disk storage;and/or magnetic disk storage or other magnetic storage devices. Suchstorage media may store information in the form of instructions or datastructures that can be accessed by a computer. Communication media cancomprise any medium that can be used to carry desired program code inthe form of instructions or data structures and that can be accessed bya computer, including any medium that facilitates transfer of a computerprogram from one place to another. Also, any connection is properlytermed a computer-readable medium. For example, if the software istransmitted from a website, server, or other remote source using acoaxial cable, fiber optic cable, twisted pair, digital subscriber line(DSL), or wireless technology such as infrared, radio, and/or microwave,then the coaxial cable, fiber optic cable, twisted pair, DSL, orwireless technology such as infrared, radio, and/or microwave areincluded in the definition of medium. Disk and disc, as used herein,includes compact disc (CD), laser disc, optical disc, digital versatiledisc (DVD), floppy disk and Blu-ray Disc™ (Blu-Ray Disc Association,Universal City, Calif.), where disks usually reproduce datamagnetically, while discs reproduce data optically with lasers.Combinations of the above should also be included within the scope ofcomputer-readable media.

An acoustic signal processing apparatus as described herein may beincorporated into an electronic device that accepts speech input inorder to control certain operations, or may otherwise benefit fromseparation of desired noises from background noises, such ascommunications devices. Many applications may benefit from enhancing orseparating clear desired sound from background sounds originating frommultiple directions. Such applications may include human-machineinterfaces in electronic or computing devices which incorporatecapabilities such as voice recognition and detection, speech enhancementand separation, voice-activated control, and the like. It may bedesirable to implement such an acoustic signal processing apparatus tobe suitable in devices that only provide limited processingcapabilities.

It is possible for one or more elements of an implementation of anapparatus as described herein to be used to perform tasks or executeother sets of instructions that are not directly related to an operationof the apparatus, such as a task relating to another operation of adevice or system in which the apparatus is embedded. It is also possiblefor one or more elements of an implementation of such an apparatus tohave structure in common (e.g., a processor used to execute portions ofcode corresponding to different elements at different times, a set ofinstructions executed to perform tasks corresponding to differentelements at different times, or an arrangement of electronic and/oroptical devices performing operations for different elements atdifferent times).

The previous description of the disclosure is provided to enable anyperson skilled in the art to make or use the disclosure. Variousmodifications to the disclosure will be readily apparent to thoseskilled in the art, and the generic principles defined herein may beapplied to other variations without departing from the scope of thedisclosure. Thus, the disclosure is not intended to be limited to theexamples and designs described herein but is to be accorded the widestscope consistent with the principles and novel features disclosedherein.

Although exemplary implementations may refer to utilizing aspects of thepresently disclosed subject matter in the context of one or morestand-alone computer systems, the subject matter is not so limited, butrather may be implemented in connection with any computing environment,such as a network or distributed computing environment. Still further,aspects of the presently disclosed subject matter may be implemented inor across a plurality of processing chips or devices, and storage maysimilarly be effected across a plurality of devices. Such devices mightinclude PCs, network servers, and handheld devices, for example.

Although the subject matter has been described in language specific tostructural features and/or methodological acts, it is to be understoodthat the subject matter defined in the appended claims is notnecessarily limited to the specific features or acts described above.Rather, the specific features and acts described above are disclosed asexample forms of implementing the claims.

What is claimed:
 1. A system which performs social interaction analysisfor a plurality of participants, comprising: a processor configured to:determine a similarity between a first spatially filtered output andeach of a plurality of second spatially filtered outputs, determine asocial interaction between the participants based on the similaritybetween the first spatially filtered output and each of the secondspatially filtered outputs, and display an output representative of thesocial interaction between the participants; wherein the first spatiallyfiltered output is received from a fixed microphone array, and thesecond spatially filtered outputs are received from a plurality ofsteerable microphone arrays each corresponding to a differentparticipant.
 2. The system of claim 1, wherein the output is displayedin real-time as the participants are interacting with each other.
 3. Thesystem of claim 1, wherein the output comprises an interaction graphcomprising: a plurality of identifiers, each identifier corresponding toa respective participant; and a plurality of indicators, each indicatorproviding information relating to at least one of: a participant lookingat another participant, a strength of an interaction between twoparticipants, a participation level of a participant, or a leader of agroup of participants.
 4. The system of claim 3, wherein the strength ofthe interaction between two participants is based on a time that the twoparticipants have interacted.
 5. The system of claim 3, wherein theindicators have at least one of a direction, a thickness, or a color,wherein the direction indicates which participant is looking at anotherparticipant, the thickness indicates the strength of the interactionbetween two participants, and the color indicates the leader of thegroup of participants.
 6. The system of claim 3, wherein each of theparticipants is a speaker.
 7. The system of claim 3, wherein theinteraction graph is used to assess group dynamics or topic dynamics. 8.The system of claim 3, wherein the interaction graph indicates socialinteraction information among the participants.
 9. The system of claim8, wherein the social interaction information is accumulated over aperiod of time.
 10. The system of claim 3, wherein the interaction graphis displayed on a smartphone.
 11. The system of claim 3, wherein theinteraction graph is displayed on at least one from among the groupcomprising a handset, a laptop, a tablet, a computer, and a netbook. 12.The system of claim 3, wherein each indicator represents activeparticipant location and energy.
 13. The system of claim 12, furthercomprising an additional indicator that represents a refined activeparticipant location and energy.
 14. The system of claim 12, wherein theindicators comprise beam patterns.
 15. The system of claim 1, whereinthe processor is further configured to perform real-time meetinganalysis of a meeting the participants are participating in.
 16. Thesystem of claim 1, wherein the processor is further configured togenerate a personal time line for a participant that shows aninteraction history of the participant with respect to the otherparticipants, a meeting topic, or a subject matter.
 17. The system ofclaim 1, wherein the processor is further configured to generateparticipant interaction statistics over time.
 18. The system of claim 1,wherein the processor is further configured to generate an evolution ofinteraction between participants over time.
 19. The system of claim 1,wherein the processor is further configured to generate an interactiongraph among the participants.
 20. The system of claim 1, furthercomprising a user interface that is configured for collaborativelyzooming into one of the participants in real-time.
 21. A method forperforming social interaction analysis for a plurality of participants,comprising: determining a similarity between a first spatially filteredoutput and each of a plurality of second spatially filtered outputs;determining a social interaction between the participants based on thesimilarity between the first spatially filtered output and each of thesecond spatially filtered outputs; and displaying an outputrepresentative of the social interaction between the participants;wherein the first spatially filtered output is received from a fixedmicrophone array, and the second spatially filtered outputs are receivedfrom a plurality of steerable microphone arrays each corresponding to adifferent participant.
 22. The method of claim 21, further comprisingdisplaying the output in real-time as the participants are interactingwith each other.
 23. The method of claim 21, wherein the outputcomprises an interaction graph comprising: a plurality of identifiers,each identifier corresponding to a respective participant; and aplurality of indicators, each indicator providing information relatingto at least one of: a participant looking at another participant, astrength of an interaction between two participants, a participationlevel of a participant, or a leader of a group of participants.
 24. Themethod of claim 23, wherein the strength of the interaction between twoparticipants is based on a time that the two participants haveinteracted.
 25. The method of claim 23, wherein the indicators have atleast one of a direction, a thickness, or a color, wherein the directionindicates which participant is looking at another participant, thethickness indicates the strength of the interaction between twoparticipants, and the color indicates the leader of the group ofparticipants.
 26. The method of claim 23, wherein each of theparticipants is a speaker.
 27. The method of claim 23, furthercomprising using the interaction graph to assess group dynamics or topicdynamics.
 28. The method of claim 23, wherein the interaction graphindicates social interaction information among the participants.
 29. Themethod of claim 28, further comprising accumulating the socialinteraction information over a period of time.
 30. The method of claim23, further comprising displaying the interaction graph on a smartphone.31. The method of claim 23, further comprising displaying theinteraction graph on at least one from among the group comprising ahandset, a laptop, a tablet, a computer, and a netbook.
 32. The methodof claim 23, wherein each indicator represents active participantlocation and energy.
 33. The method of claim 23, further comprising anadditional indicator that represents a refined active participantlocation and energy.
 34. The method of claim 23, wherein the indicatorscomprise beam patterns.
 35. The method of claim 21, further comprisingperforming real-time meeting analysis of a meeting the participants areparticipating in.
 36. The method of claim 21, further comprisinggenerating a personal time line for a participant that shows aninteraction history of the participant with respect to otherparticipants, a meeting topic, or a subject matter.
 37. The method ofclaim 21, further comprising generating participant interactionstatistics over time.
 38. The method of claim 21, further comprisinggenerating an evolution of interaction between participants over time.39. The method of claim 21, further comprising generating an interactiongraph among the participants.
 40. The method of claim 21, furthercomprising collaboratively zooming into one of the participants inreal-time.
 41. An apparatus for performing social interaction analysisfor a plurality of participants, comprising: means for determining asimilarity between a first spatially filtered output and each of aplurality of second spatially filtered outputs; means for determining asocial interaction between the participants based on the similaritybetween the first spatially filtered output and each of the secondspatially filtered outputs; and means for displaying an outputrepresentative of the social interaction between the participants;wherein the first spatially filtered output is received from a fixedmicrophone array, and the second spatially filtered outputs are receivedfrom a plurality of steerable microphone arrays each corresponding to adifferent participant.
 42. The apparatus of claim 41, further comprisingmeans for displaying the output in real-time as the participants areinteracting with each other.
 43. The apparatus of claim 41, wherein theoutput comprises an interaction graph comprising: a plurality ofidentifiers, each identifier corresponding to a respective participant;and a plurality of indicators, each indicator providing informationrelating to at least one of: a participant looking at anotherparticipant, a strength of an interaction between two participants, aparticipation level of a participant, or a leader of a group ofparticipants.
 44. The apparatus of claim 43, wherein the strength of theinteraction between two participants is based on a time that the twoparticipants have interacted.
 45. The apparatus of claim 43, wherein theindicators have at least one of a direction, a thickness, or a color,wherein the direction indicates which participant is looking at anotherparticipant, the thickness indicates the strength of the interactionbetween two participants, and the color indicates the leader of thegroup of participants.
 46. The apparatus of claim 43, wherein each ofthe participants is a speaker.
 47. The apparatus of claim 43, furthercomprising means for using the interaction graph to assess groupdynamics or topic dynamics.
 48. The apparatus of claim 43, wherein theinteraction graph indicates social interaction information among theparticipants.
 49. The apparatus of claim 48, further comprising meansfor accumulating the social interaction information over a period oftime.
 50. The apparatus of claim 43, further comprising means fordisplaying the interaction graph on a smartphone.
 51. The apparatus ofclaim 43, further comprising means for displaying the interaction graphon at least one from among the group comprising a handset, a laptop, atablet, a computer, and a netbook.
 52. The apparatus of claim 43,wherein each indicator represents active participant location andenergy.
 53. The apparatus of claim 52, further comprising an additionalindicator that represents a refined active participant location andenergy.
 54. The apparatus of claim 52, wherein the indicators comprisebeam patterns.
 55. The apparatus of claim 41, further comprising meansfor performing real-time meeting analysis of a meeting the participantsare participating in.
 56. The apparatus of claim 41, further comprisingmeans for generating a personal time line for a participant that showsan interaction history of the participant with respect to otherparticipants, a meeting topic, or a subject matter.
 57. The apparatus ofclaim 41, further comprising means for generating participantinteraction statistics over time.
 58. The apparatus of claim 41, furthercomprising means for generating an evolution of interaction betweenparticipants over time.
 59. The apparatus of claim 41, furthercomprising means for generating an interaction graph among theparticipants.
 60. The apparatus of claim 41, further comprising meansfor collaboratively zooming into one of the participants in real-time.61. A non-transitory computer-readable medium comprisingcomputer-readable instructions for causing a processor to: determine asimilarity between a first spatially filtered output and each of aplurality of second spatially filtered outputs; determine a socialinteraction between a plurality of participants based on the similaritybetween the first spatially filtered output and each of the secondspatially filtered outputs; and display an output representative of thesocial interaction between the plurality of participants; wherein thefirst spatially filtered output is received from a fixed microphonearray, and the second spatially filtered outputs are received from aplurality of steerable microphone arrays each corresponding to adifferent participant.
 62. The computer-readable medium of claim 61,further comprising instructions for causing the processor to display theoutput in real-time as the participants are interacting with each other.63. The computer-readable medium of claim 61, wherein the outputcomprises an interaction graph comprising: a plurality of identifiers,each identifier corresponding to a respective participant; and aplurality of indicators, each indicator providing information relatingto at least one of: a participant looking at another participant, astrength of an interaction between two participants, a participationlevel of a participant, or a leader of a group of participants.
 64. Thecomputer-readable medium of claim 63, wherein the strength of theinteraction between two participants is based on a time that the twoparticipants have interacted.
 65. The computer-readable medium of claim63, wherein the indicators have at least one of a direction, athickness, or a color, wherein the direction indicates which participantis looking at another participant, the thickness indicates the strengthof the interaction between two participants, and the color indicates theleader of the group of participants.
 66. The computer-readable medium ofclaim 63, wherein each of the participants is a speaker.
 67. Thecomputer-readable medium of claim 63, further comprising instructionsfor causing the processor to use the interaction graph to assess groupdynamics or topic dynamics.
 68. The computer-readable medium of claim63, wherein the interaction graph indicates social interactioninformation among the participants.
 69. The computer-readable medium ofclaim 68, further comprising instructions for causing the processor toaccumulate the social interaction information over a period of time. 70.The computer-readable medium of claim 63, further comprisinginstructions for causing the processor to display the interaction graphon a smartphone.
 71. The computer-readable medium of claim 63, furthercomprising instructions for causing the processor to display theinteraction graph on at least one from among the group comprising ahandset, a laptop, a tablet, a computer, and a netbook.
 72. Thecomputer-readable medium of claim 63, wherein each indicator representsactive participant location and energy.
 73. The computer-readable mediumof claim 72, further comprising an additional indicator that representsa refined active participant location and energy.
 74. Thecomputer-readable medium of claim 72, wherein the indicators comprisebeam patterns.
 75. The computer-readable medium of claim 61, furthercomprising instructions for causing the processor to perform real-timemeeting analysis of a meeting the participants are participating in. 76.The computer-readable medium of claim 61, further comprisinginstructions for causing the processor to generate a personal time linefor a participant that shows an interaction history of the participantwith respect to other participants, a meeting topic, or a subjectmatter.
 77. The computer-readable medium of claim 61, further comprisinginstructions for causing the processor to generate participantinteraction statistics over time.
 78. The computer-readable medium ofclaim 61, further comprising instructions for causing the processor togenerate an evolution of interaction between participants over time. 79.The computer-readable medium of claim 61, further comprisinginstructions for causing the processor to generate an interaction graphamong the participants.
 80. The computer-readable medium of claim 61,further comprising instructions for causing the processor tocollaboratively zoom into one of the participants in real-time.